Instructor demonstration points to hit

  1. Ensure students install rvest package and load it.

Chapter 15

Task 1: Load packages

Task Run the following code to load packages.

library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

For this example, we’re going to work on loading a simple table of data from the Bureau of Labor Statistics. This is a table of industry sectors (each with a two-digit NAICS code) that we could make use of in our analysis of PPP loan data.

Recall that our PPP loan data has six-digit NAICS codes for each industry, which allows us to identify the industry for each loan. For example 212221 is the code for “Gold Mining Industry”.

A six-digit NAICS code is the most specific. As we remove numbers from the right to create five-digit, four-digit, three-digit and two-digit codes, the industries they represent get broader. Here’s an example:

  • 212221 - Gold Mining Industry
  • 2122 - Metal Ore Mining Industry (which includes gold mining and things like silver mining, iron mining and copper mining)
  • 21 - Mining, Quarrying, and Oil and Gas Extraction Industry (which contains the metal mining industries mentioned above, but also oil drilling, coal mining and more).

It might be useful to have a lookup table of those top-level, two-digit NAICS codes (also called sector codes) for our analysis, to help us answer questions about what specific top-level industries got loans. L

Let’s suppose we can’t find a table like that for download, but we do see a version on the BLS website at this URL: https://www.bls.gov/ces/naics/.

Task 2: Go to the web page linked above

Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage includes a brief history of industry classification as well as a table for NAICS sectors. It displays the sector number and a description.

Task 3: Load image

Task Run the following code to display an image showing what you should see on the web page.

We could get this table into R with the following manual steps: highlighting the text, copying it into Excel, saving it as a csv, and reading it into R.

Or, we could write a few lines of webscraping code to have R do that for us!

In this simple example, it’s probably faster to do it manually than have R do it for us. And this table is unlikely to change much in the future.

Why would we ever write code to grab a single table? There’s several reasons:

  1. Our methods are transparent. If a colleague wants to run our code from scratch to factcheck our work, they don’t need to repeat the manual steps, which are harder to document than writing code.
  2. Let’s suppose we wanted to grab the same table every day, to monitor for changes (like, say, a table on a health department website that has COVID case numbers that update every day). Writing a script once, and pressing a single button every day is going to be much more efficient than doing this manually every day.
  3. If we’re doing it manually, we’re more likely to make a mistake, like maybe failing to copy every row from the whole table.
  4. It’s good practice to prepare us to do more complex scraping jobs. As we’ll see in the next chapter, if we ever want to grab the same table from hundreds of pages, writing code is much faster and easier than going to a hundred different pages ourselves and downloading data.

So, to scrape, the first thing we need to do is start with the URL. Let’s store it as an object called naics_url.

Task 4: Store URL

Task Run the following code to store the URL.

naics_url <- "https://www.bls.gov/ces/naics/"

When we go to the web page, we can see a nicely-designed page that contains our information.

But what we really care about, for our purposes, is the html code that creates that page.

In our web browser, if we right-click anywhere on the page and select “view source” from the popup menu, we can see the source code. Or you can just copy this into Google Chrome: view-source:https://www.bls.gov/ces/naics/.

Task 3: View source

Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer I see the source code. I can see the different ‘class’ tags and what the coders decided to label them as. I can also see comment code, which, similar to how it works in R, act as a guideline when writing code.

Here’s a picture of what some of the source code looks like.

Task 4: Load image

Task Run the following code to display an image showing what you should see on the web page. We’ll use those HTML tags – things like <div> and <a> and <table> – to grab the info we need.

Okay, step 1.

Let’s write a bit of code to tell R to go to the URL for the page and ingest all of that HTML code. In the code below, we’re starting with our URL and using the read_html() function from rvest to ingest all of the page html, storing it as an object called naics_industry.

Task 5: Run code to read in html

Task Run the following code to read in the html. Briefly describe the output that appears below the codeblock. Answer The output displays the html code that we read in. It shows the head and body tags as well as what language the code’s in.

# read in the html
naics_industry <- naics_url %>%
  read_html()

# display the html below
naics_industry
## {html_document}
## <html lang="en-us">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="layout-fixed">\n\t\t\t\t<section aria-label="content"><!-- P ...

If you’re running this code in R Studio, in our environment window at right, you’ll see naics_industry as a “list of 2”.

This is not a dataframe, it’s a different type of data structure a “nested list.”

If we click on the name “naics_industry” in our environment window, we can see that it’s pulled in the html and shown us the general page structure.

Task 6: Examine html

Task Follow the directions in the previous paragraph to click on naics_industry in the environment window. Briefly describe what you see there. Answer I see the name, type, and category of the name “naics_industry”. Types include list and character, and values include “list of length 2” and ‘en-us’.

Nested within the <html> tag is the <head> and <body>, the two fundamental sections of most web pages. We’re going to pull information out of the <body> tag in a bit.

Task 7: Load image

Task Run the following code to display an image showing what you should see when clicking on naics in the environment window.

Now, our task is to just pull out the section of the html that contains the information we need.

But which part do we need from that mess of html code? To figure that out, we can go back to the page in a web browser like chrome, and use built in developer tools to “inspect” the html code underlying the page.

On the page, find the data we want to grab – “Table 2. NAICS Sectors” - and right click on the word “Sector” in the column header of the table. That will bring up a dropdown menu. Select “Inspect”, which will pop up a window called the “element inspector” that shows us where different elements on the page are located, what html tags created those elements, and other info.

Task 8: Inspect element

Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer I see the tag for the table and the other elements that go inside of it, such as classes and various headings.

Task 9: Load image

Task Run the following code to display an image showing what you should see on the web page.

The entire table that we want of naics sectors is actually contained inside an html <table>. It has a header row <thead> that contains the column names and a <tbody> that contains one row <tr> per industry sector code.

Because it’s inside of a table, and not some other kind of element (like a <div>), rvest has a special function for easily extracting and converting html tables, called html_table(). This function extracts all six html tables on the page, only one of which we actually want.

Task 10: Run code to process tables

Task Run the following code to process tables. Briefly describe the output that appears below the codeblock. Answer The output produced the tables from the website, including conversion implementation dates, sectors, and divisons.

# read in the html and extract all the tables
naics_industry <- naics_url %>%
  read_html() %>%
  html_table()

# display the tables below
naics_industry
## [[1]]
## # A tibble: 4 × 3
##   Conversion               `Reference Month Implemented` `Date Released` 
##   <chr>                    <chr>                         <chr>           
## 1 SIC to NAICS 2002        May 2003                      June 6, 2003    
## 2 NAICS 2002 to NAICS 2007 January 2008                  February 1, 2008
## 3 NAICS 2007 to NAICS 2012 January 2012                  February 3, 2012
## 4 NAICS 2012 to NAICS 2017 January 2018                  February 2, 2018
## 
## [[2]]
## # A tibble: 21 × 2
##    Sector Description                                  
##    <chr>  <chr>                                        
##  1 11     Agriculture, Forestry, Fishing and Hunting   
##  2 21     Mining, Quarrying, and Oil and Gas Extraction
##  3 22     Utilities                                    
##  4 23     Construction                                 
##  5 31-33  Manufacturing                                
##  6 42     Wholesale Trade                              
##  7 44-45  Retail Trade                                 
##  8 48-49  Transportation and Warehousing               
##  9 51     Information                                  
## 10 52     Finance and Insurance                        
## # … with 11 more rows
## 
## [[3]]
## # A tibble: 11 × 2
##    Division                                      Description                    
##    <chr>                                         <chr>                          
##  1 A                                             Agriculture, Forestry, And Fis…
##  2 B                                             Mining                         
##  3 C                                             Construction                   
##  4 D                                             Manufacturing                  
##  5 E                                             Transportation, Communications…
##  6 F                                             Wholesale Trade                
##  7 G                                             Retail Trade                   
##  8 H                                             Finance, Insurance, And Real E…
##  9 I                                             Services                       
## 10 J                                             Public Administration          
## 11 Source: www.osha.gov/pls/imis/sic_manual.html Source: www.osha.gov/pls/imis/…
## 
## [[4]]
## # A tibble: 6 × 6
##   SIC              SIC     SIC        NAICS          NAICS  NAICS   
##   <chr>            <chr>   <chr>      <chr>          <chr>  <chr>   
## 1 "Level"          "Code1" "Example2" Level          Code1  Example2
## 2 "Division"       "Alpha" "D"        Sector         XX     31      
## 3 "Major Group"    "XX"    "20"       Subsector      XXX    311     
## 4 "Industry Group" "XXX"   "203"      Industry Group XXXX   3114    
## 5 "Industry"       "XXXX"  "2037"     NAICS Industry XXXXX  31141   
## 6 ""               ""      ""         U.S. Industry  XXXXXX 311411  
## 
## [[5]]
## # A tibble: 4 × 4
##   Positions `Example Value` `Field Name`             `All Possible Values`    
##   <chr>     <chr>           <chr>                    <chr>                    
## 1 1-2       CE              Survey Abbreviation      CE                       
## 2 3         U               Seasonal Adjustment Code S,U                      
## 3 4-11      31335311        Industry Code            00000000 through 90932999
## 4 12-13     01              Data Type Code           01 through 99            
## 
## [[6]]
## # A tibble: 4 × 4
##   Positions `Example Value` `Field Name`             `All Possible Values`
##   <chr>     <chr>           <chr>                    <chr>                
## 1 1-2       EE              Survey Abbreviation      EE                   
## 2 3         S               Seasonal Adjustment Code S,U                  
## 3 4-9       000000          Industry Code            000000 through 959000
## 4 10-11     01              Data Type Code           01 through 83

In the environment window at right, look at naics_industry. Note that it’s now a “list of 6”.

Click on it to open it up.

Task 11: Inspect naics_industry.

Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer I see a list of 6 that includes name, type, and value. The value and list columns include how long each table is.

It should look like this.

Task 12: Load image

Task Run the following code to display an image. This gets a little complicated, but what you’re seeing here is a nested list that contains six different data frames – also called tibbles – one for each table that exists on the web page we scraped.

They’re numbered 1 to 6. The first 1 has 4 rows and 3 columns, the second has 21 rows and 2 columns, and so on.

To examine what’s in each dataframe, mouse over the right edge (next to the word columns) on each row, and click the little scroll icon. The icon will be hidden until you mouse over it.

Click on the scroll icon for the first dataframe examine it.

Task 13: Click scroll icon

Task Follow the directions in the previous paragraphs. Briefly describe what you see there. Answer Clicking on the scroll icon shows you the full table in a new window. The first dataframe is the conversion table.

Task 14: Load image

Task Run the following code to display an image showing what you should see when you follow the steps in task 13.

That’s not the one we want!

Let’s try clicking on the scroll icon for row 2.

Task 15: Click scroll icon

Task Follow the directions in the previous paragraph. Briefly describe what you see there. Answer The scroll opened the second dataframe which is the sector and its description.

Task 16: Load image

Task Task Run the following code to display an image showing what you should see when you follow the steps in task 15.

That’s more like it! So, all we need to do now is to store that single dataframe as an object, and get rid of the rest. We can do that with this code, which says "keep only the second dataframe from our nested list. If we wanted to keep the third one, we’d change the number 2 to number 3.

Task 17: Run code to keep only table we want

Task Run the following code to keep only the table we want. Briefly describe the output that appears below the codeblock. Answer The output produces only the sector table.

# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry  <- naics_url %>%
  read_html() %>%
  html_table()

# Just keep the second dataframe in our list

naics_industry <- naics_industry[[2]]

# show the dataframe

naics_industry
## # A tibble: 21 × 2
##    Sector Description                                  
##    <chr>  <chr>                                        
##  1 11     Agriculture, Forestry, Fishing and Hunting   
##  2 21     Mining, Quarrying, and Oil and Gas Extraction
##  3 22     Utilities                                    
##  4 23     Construction                                 
##  5 31-33  Manufacturing                                
##  6 42     Wholesale Trade                              
##  7 44-45  Retail Trade                                 
##  8 48-49  Transportation and Warehousing               
##  9 51     Information                                  
## 10 52     Finance and Insurance                        
## # … with 11 more rows

We now have a proper dataframe.

From here, we can do a little light cleaning. Let’s use clean_names() to standardize the column names. Then let’s use slice() to remove the last row – row number 21 – which contains source information that will complicate our use of this table later.

Task 18: Run code to do some light cleaning

Task Run the following code to do light cleaning. Briefly describe the output that appears below the codeblock. Answer It’s a more clean and concise sector table. You don’t have to scroll left and right to see the different column headings.

# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry <- naics_url %>%
  read_html() %>%
  html_table()

# Just keep the second dataframe in our list, standardize column headers, remove last row

naics_industry <- naics_industry[[2]] %>%
  clean_names() %>%
  slice(-21)

# show the dataframe
naics_industry
## # A tibble: 20 × 2
##    sector description                                                           
##    <chr>  <chr>                                                                 
##  1 11     Agriculture, Forestry, Fishing and Hunting                            
##  2 21     Mining, Quarrying, and Oil and Gas Extraction                         
##  3 22     Utilities                                                             
##  4 23     Construction                                                          
##  5 31-33  Manufacturing                                                         
##  6 42     Wholesale Trade                                                       
##  7 44-45  Retail Trade                                                          
##  8 48-49  Transportation and Warehousing                                        
##  9 51     Information                                                           
## 10 52     Finance and Insurance                                                 
## 11 53     Real Estate and Rental and Leasing                                    
## 12 54     Professional, Scientific, and Technical Services                      
## 13 55     Management of Companies and Enterprises                               
## 14 56     Administrative and Support and Waste Management and Remediation Servi…
## 15 61     Educational Services                                                  
## 16 62     Health Care and Social Assistance                                     
## 17 71     Arts, Entertainment, and Recreation                                   
## 18 72     Accommodation and Food Services                                       
## 19 81     Other Services (except Public Administration)                         
## 20 92     Public Administration

And there we go. We now have a nice tidy dataframe of NAICS sector codes.

In the next chapter, we’ll look at a more complicated example.

Chapter 16

In the last chapter, we demonstrated a fairly straightforward example of web scraping to grab a list of NAICS industry sector codes from the BLS website.

We’re going to graduate to a more challenging example, one that will help us gather information about the number of employees in each industry sector.

What makes this more challenging? Well, the information we need is all contained on multiple pages, one page per sector. We need to write code to visit each page, and then merge them into a single data frame. This is challenging stuff, so don’t feel dissuaded if it all doesn’t click the first time through. Like many things, web scraping is something that gets easier with lots of practice.

First we start with libraries, as we always do.

Task 1: Load packages

Task Run the following code to load packages.

library(rvest)
library(tidyverse)
library(janitor)

Now, let’s run the code we wrote in the last chapter, to get a tidy list of NAICS sector codes and names from https://www.bls.gov/ces/naics/.

Task 2: Run code to load the table from the last chapter

Task Run the following code to load table from last chapter. Briefly describe the output that appears below the codeblock. Answer The output produces the clean sector table from above.

# Define url of page we want to scrape

naics_url <- "https://www.bls.gov/ces/naics/"

# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry  <- naics_url %>%
  read_html() %>%
  html_table()

# Just keep the second dataframe in our list, standardize column headers, remove last row

naics_industry <- naics_industry[[2]] %>%
  clean_names() %>%
  slice(-21)

# show the dataframe
naics_industry
## # A tibble: 20 × 2
##    sector description                                                           
##    <chr>  <chr>                                                                 
##  1 11     Agriculture, Forestry, Fishing and Hunting                            
##  2 21     Mining, Quarrying, and Oil and Gas Extraction                         
##  3 22     Utilities                                                             
##  4 23     Construction                                                          
##  5 31-33  Manufacturing                                                         
##  6 42     Wholesale Trade                                                       
##  7 44-45  Retail Trade                                                          
##  8 48-49  Transportation and Warehousing                                        
##  9 51     Information                                                           
## 10 52     Finance and Insurance                                                 
## 11 53     Real Estate and Rental and Leasing                                    
## 12 54     Professional, Scientific, and Technical Services                      
## 13 55     Management of Companies and Enterprises                               
## 14 56     Administrative and Support and Waste Management and Remediation Servi…
## 15 61     Educational Services                                                  
## 16 62     Health Care and Social Assistance                                     
## 17 71     Arts, Entertainment, and Recreation                                   
## 18 72     Accommodation and Food Services                                       
## 19 81     Other Services (except Public Administration)                         
## 20 92     Public Administration

We’ll use this table to help us get to our end goal: a single dataframe with the number of employees in each industry sector.

It will look like this when we’re done.

Task 3: Load image

Task Run the following code to display an image showing what you should see on the web page.

Unfortunately, that information doesn’t exist in a single tidy table on a single page we can scrape all at once. We’re going to have to scrape it from lots of different pages, and build it ourselves.

Let’s next take a look at the web page that has detailed employment information for one of our sectors, 22, Mining, Quarrying, and Oil and Gas Extraction.

We can find it here: https://www.bls.gov/iag/tgs/iag22.htm.

Task 4: Go to the web page linked above

Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage is about the utilities sector. It has tables that show workforce statistics including employment numbers and unemployment rates.

A few scrolls down the page, there’s a table that has employee statistics.

The table is called “Employment and Unemployment”. There’s a row in the tabled for “Employment, all employees (seasonally adjusted)”. And in that row, there’s a value for the number of employees – in thousands – in June 2021.

The table shows that for the mining sector, it was 538.6 – or 538,600 – in June 2021. That’s the value we want to ingest in R.

Task 5: Load image

Task Run the following code to display an image showing what you should see on the web page.

We don’t just want it for mining. We want it for all sectors!

But we’ll start by writing code just to get it from this one sector page, then modify that code to get it from every sector’s page

First, let’s define the URL of the page we want to get the information from.

Task 6: Store URL

Task Run the following code to store the URL.

# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"

Next, let’s read in the html of that page, and store it as an object called employment_info.

Task 7: Run code to read in html

Task Run the following code to read in the html. Briefly describe the output that appears below the codeblock. Answer The employment html is displayed below the codeblock. We see the head and body tags.

# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"

# Get employment html
employment_info <- url %>%
  read_html()  

# Display it so we can see what it looks like
employment_info
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n<!-- ****************************************** Begin HEADER ** ...

Now, let’s set to picking out the information we need from the raw html.

We can use the web inspector in our web browser (Chrome) to figure out where the table is located.

Go to the web page and right click on the word “Data Series” in the table, then pick “inspect” to pull up the menu.

Notice two things. First, all of this information is contained in a proper html <table>. And that table has an id property of “iag22emp1”. Designers use these IDs to help style the page, to target certain elements with CSS. We can use it to scrape.

Task 8: Load image

Task Run the following code to display an image showing what you should see on the web page. Recall that in the last chapter, when we used the html_table() function, it pulled in every single table on the page, six in total.

Here, we can use that id property to pick out just the table we want, and leave all the others behind.

We do that with a new function from rvest called html_element(), employing a bit of information about that element stored in what’s called the xpath. Xpath is a query language that helps us write programs that target specific parts of web pages.

The syntax is a little unwieldy, I know.

But essentially what the html_element function says is “find the html element that has an id of iag22emp1, using the xpath method, and get rid of all other elements”.

Task 9: Run code to get html_element with info we need

Task Run the following code to get html_element with info we need. Briefly describe the output that appears below the codeblock. Answer We now see the caption, thead, tbody, and tfoot tags in the output as opposed to the regular head and body tags. We isolated the table.

# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"

# Get employment html page and select only the table with employment information
employment_info <- url %>%
  read_html() %>%
  html_element(xpath = '//*[@id="iag22emp1"]')

# Display it so we can see what it looks like
employment_info
## {html_node}
## <table class="regular" id="iag22emp1">
## [1] <caption><span class="tableTitle"></span></caption>
## [2] <thead><tr>\n<th class="stubhead" id="iag22emp1.h.1.1">Data series</th>\r ...
## [3] <tbody>\n<tr>\n<th headers="iag22emp1.h.1.1" id="iag22emp1.r.1"><p class= ...
## [4] <tfoot><tr class="footnotes">\n<td class="footnotes" colspan="6">\r\n\t<p ...

We’ve now isolated the table on the page that contains the information we need, and gotten rid of everything else.

From here, we can use the html_tables() function to transform it from messy html code to a proper dataframe.

Task 10: Run code to convert to table

Task Run the following code to convert to table. Briefly describe the output that appears below the codeblock. Answer The data series table appears.

# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag22.htm"

# Get employment html page and select only the table with employment information, then transform it from html to a table.
employment_info <- url %>%
  read_html() %>%
  html_element(xpath = '//*[@id="iag22emp1"]') %>%
  html_table()

# Display it so we can see what it looks like
employment_info
## # A tibble: 6 × 6
##   `Data series`          Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##   <chr>                  <chr>       <chr>      <chr>      <chr>      <chr>     
## 1 Employment (in thousa… ""          ""         ""         ""         ""        
## 2 Employment, all emplo… ""          "538.6"    "538.9"    "(p)538.1" "(p)537.6"
## 3 Employment, productio… ""          "428.1"    "429.1"    "(p)427.4" "(p)427.1"
## 4 Unemployment           ""          ""         ""         ""         ""        
## 5 Unemployment rate      ""          "5.1%"     "3.4%"     "2.7%"     "3.1%"    
## 6 Footnotes(p) Prelimin… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…

Now we have a proper dataframe of 6 rows and 6 columns.

It has much more information than we need, so let’s clean it up to isolate only the “Employment, all employees (seasonally adjusted)” value for June 2021.

Use clean_names() to standardize the column names, use slice() to keep only the second row, and use select() to keep two columns data_series and jun_2021.

Task 11: Run code to keep row 2 and light cleaning

Task Run the following code to keep row 2 and light cleaning. Briefly describe the output that appears below the codeblock. Answer We have a cleaner table with just the row we want (the value for June 2021). It’s much easier to read.

# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag21.htm"

# Get employment html page and select only the table with employment information, then transform it from html to a table.
employment_info <- url %>%
  read_html() %>%
  html_element(xpath = '//*[@id="iag21emp1"]') %>%
  html_table()

# Keep only second row with seasonally adjusted, bind back to each_row_df
employment_info <- employment_info %>%
  clean_names() %>%
  slice(2) %>%
  select(data_series, jun_2021)

# Display it so we can see what it looks like
employment_info
## # A tibble: 1 × 2
##   data_series                                      jun_2021
##   <chr>                                            <chr>   
## 1 Employment, all employees  (seasonally adjusted) 586.7

Okay, so we’ve successfully obtained the employment numbers for one of our sectors. That’s great.

But remember our original charge: to get a table with employment numbers for ALL sectors, not just one.

This is a little tricky, because, remember, the information for each sector is on a different page!

The info for mining is on this page: https://www.bls.gov/iag/tgs/iag21.htm.

Task 12: Go to the web page linked above

Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage has information about the mining sector including a table on workforce statistics, showing employment/unemployment.

The info for construction is on this page: https://www.bls.gov/iag/tgs/iag23.htm.

Task 13: Go to the web page linked above

Task Visit the web page linked above. Briefly describe what you see there. Answer There is a table on workforce statistics, showing employment/unemployment/job openings.

We have 20 sectors to get through.

We could get the info we need by copying the codeblock we just wrote 20 times, and change the url at the top each time.

But that’s not a great approach.

What if we needed to change the code? We’d need to change it 20 times! In programming, there’s a principle called “DRY” which stands for “Don’t Repeat Yourself”. If you find yourself copying the same code over and over again, with minor changes, it’s better to find a way to avoid that.

Using for loops

Fortunately, there’s a programming paradigm called “iteration” that is helpful here, using a method called a “for loop”.

Every programming language has its own version of a “for loop”, and R is no different.

A “for loop” says: “let’s take a list of things, and do the same thing to each item on that list.”

Let’s look at a very simple example to help illustrate the values of for loops.

We’re going to write code to print out 10 industry sectors.

First, let’s do it the repetitive way. We’re writing the same print function over and over, just changing the sector name each time.

Task 14: Run code to print sectors

Task Run the following code to print sectors. Briefly describe the output that appears below the codeblock. Answer It creates a list of the new sector names.

print("Agriculture, Forestry, Fishing and Hunting")
## [1] "Agriculture, Forestry, Fishing and Hunting"
print("Mining, Quarrying, and Oil and Gas Extraction")
## [1] "Mining, Quarrying, and Oil and Gas Extraction"
print("Utilities")
## [1] "Utilities"
print("Construction")
## [1] "Construction"
print("Manufacturing")
## [1] "Manufacturing"
print("Wholesale Trade")
## [1] "Wholesale Trade"
print("Retail Trade")
## [1] "Retail Trade"
print("Transportation and Warehousing")
## [1] "Transportation and Warehousing"
print("Information")
## [1] "Information"
print("Finance and Insurance")
## [1] "Finance and Insurance"

We repeated print() 10 times, with minor modifications each time. Lots of repetition, which we seek to avoid if possible.

Now let’s look at how we might do that a little more efficiently with a “for loop.”

First let’s make a list of sectors, and save it as an object called “list_of_sectors.” The c() function tells R that we’re making a list.

Task 15: Run code to make list of sectors

Task Run the following code to make list of sectors. Briefly describe the output that appears below the codeblock. Answer Nothing appears below the code block. The list of sectors goes into the environment.

list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")

And now let’s write a “for loop” to print out sector on that list.

Task 16: Run code to use for loop to print sectors

Task Run the following code to use a for loop to print list of sectors. Briefly describe the output that appears below the codeblock. Answer It shows the print list of sectors.

# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")

# Make a for loop and run it
for (sector in list_of_sectors) {
  print(sector)
}
## [1] "Agriculture, Forestry, Fishing and Hunting"
## [1] "Mining, Quarrying, and Oil and Gas Extraction"
## [1] "Utilities"
## [1] "Construction"
## [1] "Manufacturing"
## [1] "Wholesale Trade"
## [1] "Retail Trade"
## [1] "Transportation and Warehousing"
## [1] "Information"
## [1] "Finance and Insurance"

That’s many fewer lines of code. Let’s break down what we just saw, starting with for (sector in list_of sectors).

The information inside the parentheses tells R what list to use – list_of_sectors – and how to identify list elements later on – sector.

It’s important that the thing on the right side of “in” use the exact name of the list we want to loop through – in this case “list_of_sectors”.

If we try to feed it something different – say “sector_list” – it won’t work, because our actual list is called something else – “list_of_sectors”. This code throws an error.

Task 17: Run code to use for loop to print sectors that errors

Task Run the following code to use a for loop to print list of sectors that errors. Briefly describe the output that appears below the codeblock. Answer It shows an error. The object ‘sector-list’ wasn’t found.

# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")

# For loop that refers to a list that doesn't exist!  
for (sector in sector_list) {
  print(sector)
}
## Error in eval(expr, envir, enclos): object 'sector_list' not found

The name on the left side of “in” – the word we’re assigning to represent each element – is totally arbitrary.

We could use any character string, even something simple like “x”.

What matters is that we use the same character string inside of the curly braces {}, the section of the “for loop” that tells R what to do to each element – in this case, print it out.

To illustrate this, note that the code works just fine if we change it to say this:

Task 18: Run code to use for loop to print sectors

Task Run the following code to use a for loop to print list of sectors. Briefly describe the output that appears below the codeblock. Answer The code printed the list of sectors.

# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")

# For loop with x that stands in for each element in our list, instead of sector
for (x in list_of_sectors) {
  print(x)
}
## [1] "Agriculture, Forestry, Fishing and Hunting"
## [1] "Mining, Quarrying, and Oil and Gas Extraction"
## [1] "Utilities"
## [1] "Construction"
## [1] "Manufacturing"
## [1] "Wholesale Trade"
## [1] "Retail Trade"
## [1] "Transportation and Warehousing"
## [1] "Information"
## [1] "Finance and Insurance"

But it does NOT work if we call each element one thing – x – in the first line of our “for loop”, and use a different name to refer to it inside of the curly braces.

In this code below, it has no idea what we mean by “sector_name”, because we haven’t defined that anywhere.

Task 19: Run code to use for loop to print sectors that errors

Task Run the following code to use a for loop to print list of sectors. Briefly describe the output that appears below the codeblock. Answer It shows an error, as the object ‘sector name’ cannot be found.

# Define list of sectors
list_of_sectors <- c("Agriculture, Forestry, Fishing and Hunting", "Mining, Quarrying, and Oil and Gas Extraction", "Utilities", "Construction", "Manufacturing",
"Wholesale Trade", "Retail Trade", "Transportation and Warehousing", "Information", "Finance and Insurance")

# For loop that includes instructions that refer to a variable that doesn't exist.
for (x in list_of_sectors) {
  print(sector_name)
}
## Error in print(sector_name): object 'sector_name' not found

We can also write for loops to iterate over a range of numbers, instead of a list of characters. The syntax is a little different.

The code below says: “for each number in a range of numbers from 1 to 10, print the number.”

Task 20: Run code to use for loop to print 1 to 10

Task Run the following code to use a for loop to print 1 to 10. Briefly describe the output that appears below the codeblock. Answer The code printed the numbers 1-10.

# For loop that includes instructions that refer to a variable that doesn't exist.
for (number in 1:10) {
  print(number)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Here’s a minor variation on that approach that we’ll make use of below.

Instead of giving the for loop an explicit number range, like 1:10, we can tell it to use 1 to “the number of rows in a dataframe” as our list of things to loop through.

Remember the naics_industry dataframe we loaded first? It has 20 rows. ### Task 21: Run code to display naics_industry Task Run the following code to display naics_industry. Briefly describe the output that appears below the codeblock. Answer It shows the naics_industry table.

naics_industry
## # A tibble: 20 × 2
##    sector description                                                           
##    <chr>  <chr>                                                                 
##  1 11     Agriculture, Forestry, Fishing and Hunting                            
##  2 21     Mining, Quarrying, and Oil and Gas Extraction                         
##  3 22     Utilities                                                             
##  4 23     Construction                                                          
##  5 31-33  Manufacturing                                                         
##  6 42     Wholesale Trade                                                       
##  7 44-45  Retail Trade                                                          
##  8 48-49  Transportation and Warehousing                                        
##  9 51     Information                                                           
## 10 52     Finance and Insurance                                                 
## 11 53     Real Estate and Rental and Leasing                                    
## 12 54     Professional, Scientific, and Technical Services                      
## 13 55     Management of Companies and Enterprises                               
## 14 56     Administrative and Support and Waste Management and Remediation Servi…
## 15 61     Educational Services                                                  
## 16 62     Health Care and Social Assistance                                     
## 17 71     Arts, Entertainment, and Recreation                                   
## 18 72     Accommodation and Food Services                                       
## 19 81     Other Services (except Public Administration)                         
## 20 92     Public Administration

We can use that information in our for loop by using the nrow() function, which calculates the number of rows in a dataframe. Here’s a quick demonstration of how that works.

Task 22: Run code to display number of rows in naics_industry

Task Run the following code to display number of rows naics_industry. Briefly describe the output that appears below the codeblock. Answer The code shows that there are 20 rows in naics_industry.

nrow(naics_industry)
## [1] 20

To put it all together, the code below says “make a list of numbers that starts at 1 and ends at the number of rows in the naics_industry dataframe (which is 20), then print out each of these numbers.”

Task 23: Run code to print each row number in naics_industry

Task Run the following code to to print each row number in naics_industry. Briefly describe the output that appears below the codeblock. Answer The output shows a rprint of the list of 20 rows in naics_industry,

# For loop that includes instructions that refer to a variable that doesn't exist.
for (row_number in 1:nrow(naics_industry)) {
  print(row_number)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20

These were basic examples of how “for loops” work. Next, we’ll learn to apply “for loops” to efficentily extract information from multiple web pages.

Looping and rvest

First, let’s look at the codeblock we wrote earlier to extract the number of employees in the mining sector.

Task 24: Run code to load table from earlier

Task Run the following code to load table from earlier. Briefly describe the output that appears below the codeblock. Answer It shows the number of employees in the mining sector in June 2021.

# Define url of the page we want to get
url <- "https://www.bls.gov/iag/tgs/iag21.htm"

# Get employment html page and select only the table with employment information, then transform it from html to a table.
employment_info <- url %>%
  read_html() %>%
  html_element(xpath = '//*[@id="iag21emp1"]') %>%
  html_table()

# Keep only second row with seasonally adjusted, bind back to each_row_df
employment_info <- employment_info %>%
  clean_names() %>%
  slice(2) %>%
  select(data_series, jun_2021)

# Display it so we can see what it looks like
employment_info
## # A tibble: 1 × 2
##   data_series                                      jun_2021
##   <chr>                                            <chr>   
## 1 Employment, all employees  (seasonally adjusted) 586.7

This contains all the steps we needed to extract the information from one sector page. We’re now going to modify this function so we can use it to extract information from each sector page, writing code that keeps us from repeating ourselves too much.

First, we need to build a list of URLs to loop through in a “for loop.” We can do that using the dataframe we made in the last chapter. ### Task 25: Run code to load the table from the last chapter Task Run the following code to load table from last chapter. Briefly describe the output that appears below the codeblock. Answer It shows the sector table from the last chapter.

# Define url of page we want to scrape

naics_url <- "https://www.bls.gov/ces/naics/"

# Read in all html from table, store all tables on page as nested list of dataframes.
naics_industry  <- naics_url %>%
  read_html() %>%
  html_table()

# Just keep the second dataframe in our list, standardize column headers, remove last row

naics_industry <- naics_industry[[2]] %>%
  clean_names() %>%
  slice(-21)

# show the dataframe
naics_industry
## # A tibble: 20 × 2
##    sector description                                                           
##    <chr>  <chr>                                                                 
##  1 11     Agriculture, Forestry, Fishing and Hunting                            
##  2 21     Mining, Quarrying, and Oil and Gas Extraction                         
##  3 22     Utilities                                                             
##  4 23     Construction                                                          
##  5 31-33  Manufacturing                                                         
##  6 42     Wholesale Trade                                                       
##  7 44-45  Retail Trade                                                          
##  8 48-49  Transportation and Warehousing                                        
##  9 51     Information                                                           
## 10 52     Finance and Insurance                                                 
## 11 53     Real Estate and Rental and Leasing                                    
## 12 54     Professional, Scientific, and Technical Services                      
## 13 55     Management of Companies and Enterprises                               
## 14 56     Administrative and Support and Waste Management and Remediation Servi…
## 15 61     Educational Services                                                  
## 16 62     Health Care and Social Assistance                                     
## 17 71     Arts, Entertainment, and Recreation                                   
## 18 72     Accommodation and Food Services                                       
## 19 81     Other Services (except Public Administration)                         
## 20 92     Public Administration

This gives us the sector code and name for each industry.

Now let’s have a look at the URLs for a few of the pages we want to grab data from.

Notice a pattern?

They all start with “https://www.bls.gov/iag/tgs/iag”. The next bit of information is different for each one; with the two-digit sector code for each sector. The remainder is identical in all three links, “.htm”.

Because they’re all the same, we can use the information in the dataframe we just loaded to make all the URLs we need.

We’re going to use mutate() and paste0() to concatenate (mash together) the things that stay constant in every url (the beginning and end) with the things that are different (the sector number, stored in the column called sector).

Task 26: Run code to build url

Task Run the following code to build url. Briefly describe the output that appears below the codeblock. Answer It shows us the same table, but with an added sector_url column.

# Make a column with URL for each sector.
naics_industry <- naics_industry %>%
  mutate(sector_url = paste0("https://www.bls.gov/iag/tgs/iag",sector,".htm"))

# Display it
naics_industry
## # A tibble: 20 × 3
##    sector description                                  sector_url               
##    <chr>  <chr>                                        <chr>                    
##  1 11     Agriculture, Forestry, Fishing and Hunting   https://www.bls.gov/iag/…
##  2 21     Mining, Quarrying, and Oil and Gas Extracti… https://www.bls.gov/iag/…
##  3 22     Utilities                                    https://www.bls.gov/iag/…
##  4 23     Construction                                 https://www.bls.gov/iag/…
##  5 31-33  Manufacturing                                https://www.bls.gov/iag/…
##  6 42     Wholesale Trade                              https://www.bls.gov/iag/…
##  7 44-45  Retail Trade                                 https://www.bls.gov/iag/…
##  8 48-49  Transportation and Warehousing               https://www.bls.gov/iag/…
##  9 51     Information                                  https://www.bls.gov/iag/…
## 10 52     Finance and Insurance                        https://www.bls.gov/iag/…
## 11 53     Real Estate and Rental and Leasing           https://www.bls.gov/iag/…
## 12 54     Professional, Scientific, and Technical Ser… https://www.bls.gov/iag/…
## 13 55     Management of Companies and Enterprises      https://www.bls.gov/iag/…
## 14 56     Administrative and Support and Waste Manage… https://www.bls.gov/iag/…
## 15 61     Educational Services                         https://www.bls.gov/iag/…
## 16 62     Health Care and Social Assistance            https://www.bls.gov/iag/…
## 17 71     Arts, Entertainment, and Recreation          https://www.bls.gov/iag/…
## 18 72     Accommodation and Food Services              https://www.bls.gov/iag/…
## 19 81     Other Services (except Public Administratio… https://www.bls.gov/iag/…
## 20 92     Public Administration                        https://www.bls.gov/iag/…

While we’re at it, we’re going to use the same method to programatically build the “xpath” for the table on each sector page.

Recall that when we wrote our function that got information from just the mining page, the xpath targeted an element with an ID of “iag21emp1”. Why 21? That’s the sector code for mining.

If we look for that exact element ID on other sector pages, we won’t find it! That’s because it’s different for each page.

On the Utilities page (sector code 22), the ID for the table we want is “iag22emp1”. On the Construction page (sector code 23), it’s “iag23emp1”. We can also build this programatically, because it follows a predictable pattern.

Task 27: Run code to build id

Task Run the following code to build id. Briefly describe the output that appears below the codeblock. Answer Now we have a sector_xpath_id column in the same table as above.

# Make a column with URL and xpath ID for each sector
naics_industry <- naics_industry %>%
  mutate(sector_url = paste0("https://www.bls.gov/iag/tgs/iag",sector,".htm")) %>%
  mutate(sector_xpath_id =paste0("iag",sector,"emp1"))

# Display it
naics_industry
## # A tibble: 20 × 4
##    sector description                       sector_url           sector_xpath_id
##    <chr>  <chr>                             <chr>                <chr>          
##  1 11     Agriculture, Forestry, Fishing a… https://www.bls.gov… iag11emp1      
##  2 21     Mining, Quarrying, and Oil and G… https://www.bls.gov… iag21emp1      
##  3 22     Utilities                         https://www.bls.gov… iag22emp1      
##  4 23     Construction                      https://www.bls.gov… iag23emp1      
##  5 31-33  Manufacturing                     https://www.bls.gov… iag31-33emp1   
##  6 42     Wholesale Trade                   https://www.bls.gov… iag42emp1      
##  7 44-45  Retail Trade                      https://www.bls.gov… iag44-45emp1   
##  8 48-49  Transportation and Warehousing    https://www.bls.gov… iag48-49emp1   
##  9 51     Information                       https://www.bls.gov… iag51emp1      
## 10 52     Finance and Insurance             https://www.bls.gov… iag52emp1      
## 11 53     Real Estate and Rental and Leasi… https://www.bls.gov… iag53emp1      
## 12 54     Professional, Scientific, and Te… https://www.bls.gov… iag54emp1      
## 13 55     Management of Companies and Ente… https://www.bls.gov… iag55emp1      
## 14 56     Administrative and Support and W… https://www.bls.gov… iag56emp1      
## 15 61     Educational Services              https://www.bls.gov… iag61emp1      
## 16 62     Health Care and Social Assistance https://www.bls.gov… iag62emp1      
## 17 71     Arts, Entertainment, and Recreat… https://www.bls.gov… iag71emp1      
## 18 72     Accommodation and Food Services   https://www.bls.gov… iag72emp1      
## 19 81     Other Services (except Public Ad… https://www.bls.gov… iag81emp1      
## 20 92     Public Administration             https://www.bls.gov… iag92emp1

Lastly, we’re going to use filter to remove the “Public Administration” sector, because there’s no page for it. We’ll have to get that information some other way.

Task 28: Run code to filter table

Task Run the following code to filter table. Briefly describe the output that appears below the codeblock. Answer The output shows the same table as above, just without the “Public Administration” sector.

# Make a column with URL and xpath ID for each sector, remove the Public Administration sector
naics_industry <- naics_industry %>%
  mutate(sector_url = paste0("https://www.bls.gov/iag/tgs/iag",sector,".htm")) %>%
  mutate(sector_xpath_id =paste0("iag",sector,"emp1")) %>%
  filter(description != "Public Administration")
# Display it
naics_industry
## # A tibble: 19 × 4
##    sector description                       sector_url           sector_xpath_id
##    <chr>  <chr>                             <chr>                <chr>          
##  1 11     Agriculture, Forestry, Fishing a… https://www.bls.gov… iag11emp1      
##  2 21     Mining, Quarrying, and Oil and G… https://www.bls.gov… iag21emp1      
##  3 22     Utilities                         https://www.bls.gov… iag22emp1      
##  4 23     Construction                      https://www.bls.gov… iag23emp1      
##  5 31-33  Manufacturing                     https://www.bls.gov… iag31-33emp1   
##  6 42     Wholesale Trade                   https://www.bls.gov… iag42emp1      
##  7 44-45  Retail Trade                      https://www.bls.gov… iag44-45emp1   
##  8 48-49  Transportation and Warehousing    https://www.bls.gov… iag48-49emp1   
##  9 51     Information                       https://www.bls.gov… iag51emp1      
## 10 52     Finance and Insurance             https://www.bls.gov… iag52emp1      
## 11 53     Real Estate and Rental and Leasi… https://www.bls.gov… iag53emp1      
## 12 54     Professional, Scientific, and Te… https://www.bls.gov… iag54emp1      
## 13 55     Management of Companies and Ente… https://www.bls.gov… iag55emp1      
## 14 56     Administrative and Support and W… https://www.bls.gov… iag56emp1      
## 15 61     Educational Services              https://www.bls.gov… iag61emp1      
## 16 62     Health Care and Social Assistance https://www.bls.gov… iag62emp1      
## 17 71     Arts, Entertainment, and Recreat… https://www.bls.gov… iag71emp1      
## 18 72     Accommodation and Food Services   https://www.bls.gov… iag72emp1      
## 19 81     Other Services (except Public Ad… https://www.bls.gov… iag81emp1

We’re left with a dataframe of 19 rows and 4 columns. It now contains everything we need.

Next, we’ll construct a “for loop” to extract the info we need from each page. We’re going to build it up step-by-step, beginning with the the basic elements of our “for loop”.

The codeblock below says: “Make a list with the row numbers from 1 to the number of rows in our naics_industry dataframe (which is 19). Then, for each element of that list (1, 2, 3, 4, 5 and so on up to 19), use slice() to keep only the one row that matches that number and save this newly created dataframe as each_row_df. Print out the dataframe. Then go to the next element on the list and do the same thing. Keep doing that until we hit number 19, then stop.”

We get 19 dataframes, each with one row, one for each sector.

Task 29: Run code to run for loop and keep one row

Task Run the following for loop and keep one row. Briefly describe the output that appears below the codeblock. Answer The output shows tables for each row that matches its respective number.

# For loop, iterating over each row in our naics industry dataframe

for(row_number in 1:nrow(naics_industry)) {

    # Keep only the row for a given row number, get rid of every other row
    each_row_df <- naics_industry %>%
      slice(row_number)

    # To help us see what's happening as we build this, we're going to print the thing we're creating.  
    print(each_row_df)

}
## # A tibble: 1 × 4
##   sector description                                sector_url   sector_xpath_id
##   <chr>  <chr>                                      <chr>        <chr>          
## 1 11     Agriculture, Forestry, Fishing and Hunting https://www… iag11emp1      
## # A tibble: 1 × 4
##   sector description                                   sector_url sector_xpath_id
##   <chr>  <chr>                                         <chr>      <chr>          
## 1 21     Mining, Quarrying, and Oil and Gas Extraction https://w… iag21emp1      
## # A tibble: 1 × 4
##   sector description sector_url                            sector_xpath_id
##   <chr>  <chr>       <chr>                                 <chr>          
## 1 22     Utilities   https://www.bls.gov/iag/tgs/iag22.htm iag22emp1      
## # A tibble: 1 × 4
##   sector description  sector_url                            sector_xpath_id
##   <chr>  <chr>        <chr>                                 <chr>          
## 1 23     Construction https://www.bls.gov/iag/tgs/iag23.htm iag23emp1      
## # A tibble: 1 × 4
##   sector description   sector_url                               sector_xpath_id
##   <chr>  <chr>         <chr>                                    <chr>          
## 1 31-33  Manufacturing https://www.bls.gov/iag/tgs/iag31-33.htm iag31-33emp1   
## # A tibble: 1 × 4
##   sector description     sector_url                            sector_xpath_id
##   <chr>  <chr>           <chr>                                 <chr>          
## 1 42     Wholesale Trade https://www.bls.gov/iag/tgs/iag42.htm iag42emp1      
## # A tibble: 1 × 4
##   sector description  sector_url                               sector_xpath_id
##   <chr>  <chr>        <chr>                                    <chr>          
## 1 44-45  Retail Trade https://www.bls.gov/iag/tgs/iag44-45.htm iag44-45emp1   
## # A tibble: 1 × 4
##   sector description                    sector_url               sector_xpath_id
##   <chr>  <chr>                          <chr>                    <chr>          
## 1 48-49  Transportation and Warehousing https://www.bls.gov/iag… iag48-49emp1   
## # A tibble: 1 × 4
##   sector description sector_url                            sector_xpath_id
##   <chr>  <chr>       <chr>                                 <chr>          
## 1 51     Information https://www.bls.gov/iag/tgs/iag51.htm iag51emp1      
## # A tibble: 1 × 4
##   sector description           sector_url                            sector_xpath_id
##   <chr>  <chr>                 <chr>                                 <chr>          
## 1 52     Finance and Insurance https://www.bls.gov/iag/tgs/iag52.htm iag52emp1      
## # A tibble: 1 × 4
##   sector description                        sector_url           sector_xpath_id
##   <chr>  <chr>                              <chr>                <chr>          
## 1 53     Real Estate and Rental and Leasing https://www.bls.gov… iag53emp1      
## # A tibble: 1 × 4
##   sector description                                      sector_url sector_xpath_id
##   <chr>  <chr>                                            <chr>      <chr>          
## 1 54     Professional, Scientific, and Technical Services https://w… iag54emp1      
## # A tibble: 1 × 4
##   sector description                             sector_url      sector_xpath_id
##   <chr>  <chr>                                   <chr>           <chr>          
## 1 55     Management of Companies and Enterprises https://www.bl… iag55emp1      
## # A tibble: 1 × 4
##   sector description                        sector_url           sector_xpath_id
##   <chr>  <chr>                              <chr>                <chr>          
## 1 56     Administrative and Support and Wa… https://www.bls.gov… iag56emp1      
## # A tibble: 1 × 4
##   sector description          sector_url                            sector_xpath_id
##   <chr>  <chr>                <chr>                                 <chr>          
## 1 61     Educational Services https://www.bls.gov/iag/tgs/iag61.htm iag61emp1      
## # A tibble: 1 × 4
##   sector description                       sector_url            sector_xpath_id
##   <chr>  <chr>                             <chr>                 <chr>          
## 1 62     Health Care and Social Assistance https://www.bls.gov/… iag62emp1      
## # A tibble: 1 × 4
##   sector description                         sector_url          sector_xpath_id
##   <chr>  <chr>                               <chr>               <chr>          
## 1 71     Arts, Entertainment, and Recreation https://www.bls.go… iag71emp1      
## # A tibble: 1 × 4
##   sector description                     sector_url              sector_xpath_id
##   <chr>  <chr>                           <chr>                   <chr>          
## 1 72     Accommodation and Food Services https://www.bls.gov/ia… iag72emp1      
## # A tibble: 1 × 4
##   sector description                                   sector_url sector_xpath_id
##   <chr>  <chr>                                         <chr>      <chr>          
## 1 81     Other Services (except Public Administration) https://w… iag81emp1

We’re almost to the part where we can go out and fetch the html we need. Before we do that, let’s store as part of our loop an object called “url”, which contains the URL of the page for each sector.

The syntax with the dollar sign is a little funky, but “each_row_df$sector_url” says “from the each_row_df dataframe, grab the information in the sector_url column.” Because the column has only one row, there’s one value.

We’re going to do something simliar with the xpath for our employment table by using the information in the sector_xpath_id column.

That code also looks a little unwieldly. Recall that the xpath for the mining industry was '//*[@id="iag22emp1"]'.

In the code below, we’re building the xpath dynamically by pasting together the parts that stay the same for each xpath – '//*[@id="' and '"]' – and the parts that change for each sector, pulled from the xpath_sector_id column.

To see how this is working, we’re going to edit our print statement at the end a bit, printing the row_number and the dynamically created url and xpath.

Task 30: Run code to run for loop to store url and xpath value

Task Run the following for loop to store url and xpath value. Briefly describe the output that appears below the codeblock. Answer The output stored url and xpath value in the environment. It printed the row number and the url and xpath.

# For loop, iterating over each row in our naics industry dataframe

for(row_number in 1:nrow(naics_industry)) {

    # Keep only the row for a given row number, get rid of every other row
    each_row_df <- naics_industry %>%
      slice(row_number)

    # Define url of page to get
    url <- each_row_df$sector_url

    # Define id of table to ingest
    xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')

    # To help us see what's happening as we build this, we're going to print the thing we're creating.  
    print(paste0("ROW NUMBER:", row_number," URL: ",url," XPATH:",xpath_employment_table))

}
## [1] "ROW NUMBER:1 URL: https://www.bls.gov/iag/tgs/iag11.htm XPATH://*[@id=\"iag11emp1\"]"
## [1] "ROW NUMBER:2 URL: https://www.bls.gov/iag/tgs/iag21.htm XPATH://*[@id=\"iag21emp1\"]"
## [1] "ROW NUMBER:3 URL: https://www.bls.gov/iag/tgs/iag22.htm XPATH://*[@id=\"iag22emp1\"]"
## [1] "ROW NUMBER:4 URL: https://www.bls.gov/iag/tgs/iag23.htm XPATH://*[@id=\"iag23emp1\"]"
## [1] "ROW NUMBER:5 URL: https://www.bls.gov/iag/tgs/iag31-33.htm XPATH://*[@id=\"iag31-33emp1\"]"
## [1] "ROW NUMBER:6 URL: https://www.bls.gov/iag/tgs/iag42.htm XPATH://*[@id=\"iag42emp1\"]"
## [1] "ROW NUMBER:7 URL: https://www.bls.gov/iag/tgs/iag44-45.htm XPATH://*[@id=\"iag44-45emp1\"]"
## [1] "ROW NUMBER:8 URL: https://www.bls.gov/iag/tgs/iag48-49.htm XPATH://*[@id=\"iag48-49emp1\"]"
## [1] "ROW NUMBER:9 URL: https://www.bls.gov/iag/tgs/iag51.htm XPATH://*[@id=\"iag51emp1\"]"
## [1] "ROW NUMBER:10 URL: https://www.bls.gov/iag/tgs/iag52.htm XPATH://*[@id=\"iag52emp1\"]"
## [1] "ROW NUMBER:11 URL: https://www.bls.gov/iag/tgs/iag53.htm XPATH://*[@id=\"iag53emp1\"]"
## [1] "ROW NUMBER:12 URL: https://www.bls.gov/iag/tgs/iag54.htm XPATH://*[@id=\"iag54emp1\"]"
## [1] "ROW NUMBER:13 URL: https://www.bls.gov/iag/tgs/iag55.htm XPATH://*[@id=\"iag55emp1\"]"
## [1] "ROW NUMBER:14 URL: https://www.bls.gov/iag/tgs/iag56.htm XPATH://*[@id=\"iag56emp1\"]"
## [1] "ROW NUMBER:15 URL: https://www.bls.gov/iag/tgs/iag61.htm XPATH://*[@id=\"iag61emp1\"]"
## [1] "ROW NUMBER:16 URL: https://www.bls.gov/iag/tgs/iag62.htm XPATH://*[@id=\"iag62emp1\"]"
## [1] "ROW NUMBER:17 URL: https://www.bls.gov/iag/tgs/iag71.htm XPATH://*[@id=\"iag71emp1\"]"
## [1] "ROW NUMBER:18 URL: https://www.bls.gov/iag/tgs/iag72.htm XPATH://*[@id=\"iag72emp1\"]"
## [1] "ROW NUMBER:19 URL: https://www.bls.gov/iag/tgs/iag81.htm XPATH://*[@id=\"iag81emp1\"]"

Armed with the URL and xpath for each sector web page, we can now go out and get the employment table for each sector.

We’ll read in the html from the url we just stored; extract the table that has the xpath ID we just created; and then transform the html table code into a proper dataframe.

The dataframe is hidden inside a nested list, which we’ll have to extract in the next step.

So, when you run this code, it will print out 19 dataframes inside of nested lists, each containing one dataframe.

Task 31: Run code to run for loop to get tables

Task Run the following for loop to get tables. Briefly describe the output that appears below the codeblock. Answer The output shows 19 tables and it also printed the data series table.

# For loop, iterating over each row in our naics industry dataframe

for(row_number in 1:nrow(naics_industry)) {

    # Keep only the row for a given row number, get rid of every other row
    each_row_df <- naics_industry %>%
      slice(row_number)

    # Define url of page to get
    url <- each_row_df$sector_url

    # Define id of table to ingest
    xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')

    # Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table(). The dataframe is in a nested list, which we'll have to extract in the next step.
    employment_info <- url %>%
      read_html() %>%
      html_elements(xpath = xpath_employment_table) %>%
      html_table()

    # To help us see what's happening as we build this, we're going to print the thing we're creating.  
    print(employment_info)


}
## [[1]]
## # A tibble: 2 × 6
##   `Data series`     Backdata Jun.2021 Jul.2021 Aug.2021 Sep.2021
##   <chr>             <lgl>    <chr>    <chr>    <chr>    <chr>   
## 1 Unemployment      NA       ""       ""       ""       ""      
## 2 Unemployment rate NA       "7.5%"   "7.6%"   "5.7%"   "5.0%"  
## 
## [[1]]
## # A tibble: 6 × 6
##   `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021    Sep.2021  
##   <chr>               <chr>       <chr>       <chr>       <chr>       <chr>     
## 1 Employment (in tho… ""          ""          ""          ""          ""        
## 2 Employment, all em… ""          "586.7"     "592.5"     "(p)598.3"  "(p)603.0"
## 3 Employment, produc… ""          "427.6"     "433.6"     "(p)438.8"  ""        
## 4 Unemployment        ""          ""          ""          ""          ""        
## 5 Unemployment rate   ""          "10.3%"     "8.9%"      "10.2%"     "7.3%"    
## 6 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
## 
## [[1]]
## # A tibble: 6 × 6
##   `Data series`          Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##   <chr>                  <chr>       <chr>      <chr>      <chr>      <chr>     
## 1 Employment (in thousa… ""          ""         ""         ""         ""        
## 2 Employment, all emplo… ""          "538.6"    "538.9"    "(p)538.1" "(p)537.6"
## 3 Employment, productio… ""          "428.1"    "429.1"    "(p)427.4" "(p)427.1"
## 4 Unemployment           ""          ""         ""         ""         ""        
## 5 Unemployment rate      ""          "5.1%"     "3.4%"     "2.7%"     "3.1%"    
## 6 Footnotes(p) Prelimin… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`         Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##    <chr>                 <chr>       <chr>      <chr>      <chr>      <chr>     
##  1 Employment (in thous… ""          ""         ""         ""         ""        
##  2 Employment, all empl… ""          "7,413"    "7,425"    "(p)7,425" "(p)7,447"
##  3 Employment, producti… ""          "5,494"    "5,517"    "(p)5,510" "(p)5,518"
##  4 Unemployment          ""          ""         ""         ""         ""        
##  5 Unemployment rate     ""          "7.5%"     "6.1%"     "4.6%"     "4.5%"    
##  6 Job openings, hires,… ""          ""         ""         ""         ""        
##  7 Job openings          ""          "344"      "377"      "(p)365"   ""        
##  8 Hires                 ""          "441"      "430"      "(p)381"   ""        
##  9 Separations           ""          "352"      "402"      "(p)431"   ""        
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`         Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##    <chr>                 <chr>       <chr>      <chr>      <chr>      <chr>     
##  1 Employment (in thous… ""          ""         ""         ""         ""        
##  2 Employment, all empl… ""          "12,332"   "12,389"   "(p)12,42… "(p)12,44…
##  3 Employment, producti… ""          "8,558"    "8,617"    "(p)8,641" "(p)8,655"
##  4 Unemployment          ""          ""         ""         ""         ""        
##  5 Unemployment rate     ""          "5.4%"     "4.2%"     "3.6%"     "3.9%"    
##  6 Job openings, hires,… ""          ""         ""         ""         ""        
##  7 Job openings          ""          "937"      "974"      "(p)892"   ""        
##  8 Hires                 ""          "554"      "526"      "(p)515"   ""        
##  9 Separations           ""          "426"      "452"      "(p)518"   ""        
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`         Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##    <chr>                 <chr>       <chr>      <chr>      <chr>      <chr>     
##  1 Employment (in thous… ""          ""         ""         ""         ""        
##  2 Employment, all empl… ""          "5,707.1"  "5,722.1"  "(p)5,718… "(p)5,735…
##  3 Employment, producti… ""          "4,534.6"  "4,547.8"  "(p)4,549… "(p)4,568…
##  4 Unemployment          ""          ""         ""         ""         ""        
##  5 Unemployment rate     ""          "4.5%"     "4.6%"     "4.3%"     "3.7%"    
##  6 Job openings, hires,… ""          ""         ""         ""         ""        
##  7 Job openings          ""          "287"      "291"      "(p)263"   ""        
##  8 Hires                 ""          "195"      "199"      "(p)201"   ""        
##  9 Separations           ""          "142"      "177"      "(p)215"   ""        
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`         Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##    <chr>                 <chr>       <chr>      <chr>      <chr>      <chr>     
##  1 Employment (in thous… ""          ""         ""         ""         ""        
##  2 Employment, all empl… ""          "15,361.4" "15,355.3" "(p)15,35… "(p)15,40…
##  3 Employment, producti… ""          "13,057.1" "13,068.6" "(p)13,07… "(p)13,11…
##  4 Unemployment          ""          ""         ""         ""         ""        
##  5 Unemployment rate     ""          "6.6%"     "6.4%"     "6.5%"     "6.1%"    
##  6 Job openings, hires,… ""          ""         ""         ""         ""        
##  7 Job openings          ""          "1,228"    "1,245"    "(p)1,324" ""        
##  8 Hires                 ""          "1,171"    "905"      "(p)987"   ""        
##  9 Separations           ""          "861"      "913"      "(p)1,097" ""        
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 6 × 6
##   `Data series`          Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##   <chr>                  <chr>       <chr>      <chr>      <chr>      <chr>     
## 1 Employment (in thousa… ""          ""         ""         ""         ""        
## 2 Employment, all emplo… ""          "5,736.8"  "5,792.9"  "(p)5,847… "(p)5,894…
## 3 Employment, productio… ""          "5,022.0"  "5,052.6"  "(p)5,074… "(p)5,102…
## 4 Unemployment           ""          ""         ""         ""         ""        
## 5 Unemployment rate      ""          "6.2%"     "7.3%"     "6.4%"     "5.7%"    
## 6 Footnotes(p) Prelimin… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`         Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##    <chr>                 <chr>       <chr>      <chr>      <chr>      <chr>     
##  1 Employment (in thous… ""          ""         ""         ""         ""        
##  2 Employment, all empl… ""          "2,726"    "2,745"    "(p)2,774" "(p)2,806"
##  3 Employment, producti… ""          "2,161"    "2,183"    "(p)2,201" "(p)2,233"
##  4 Unemployment          ""          ""         ""         ""         ""        
##  5 Unemployment rate     ""          "6.1%"     "5.6%"     "4.4%"     "4.0%"    
##  6 Job openings, hires,… ""          ""         ""         ""         ""        
##  7 Job openings          ""          "152"      "181"      "(p)161"   ""        
##  8 Hires                 ""          "96"       "106"      "(p)101"   ""        
##  9 Separations           ""          "93"       "85"       "(p)93"    ""        
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 9 × 6
##   `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021    Sep.2021  
##   <chr>               <chr>       <chr>       <chr>       <chr>       <chr>     
## 1 Employment (in tho… ""          ""          ""          ""          ""        
## 2 Employment, all em… ""          "6,534.3"   "6,545.1"   "(p)6,546.… "(p)6,541…
## 3 Unemployment        ""          ""          ""          ""          ""        
## 4 Unemployment rate   ""          "2.8%"      "3.1%"      "3.5%"      "2.0%"    
## 5 Job openings, hire… ""          ""          ""          ""          ""        
## 6 Job openings        ""          "229"       "354"       "(p)310"    ""        
## 7 Hires               ""          "156"       "164"       "(p)141"    ""        
## 8 Separations         ""          "140"       "140"       "(p)174"    ""        
## 9 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
## 
## [[1]]
## # A tibble: 9 × 6
##   `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021    Sep.2021  
##   <chr>               <chr>       <chr>       <chr>       <chr>       <chr>     
## 1 Employment (in tho… ""          ""          ""          ""          ""        
## 2 Employment, all em… ""          "2,271.8"   "2,291.9"   "(p)2,302.… "(p)2,308…
## 3 Unemployment        ""          ""          ""          ""          ""        
## 4 Unemployment rate   ""          "5.0%"      "2.8%"      "2.6%"      "3.8%"    
## 5 Job openings, hire… ""          ""          ""          ""          ""        
## 6 Job openings        ""          "105"       "199"       "(p)167"    ""        
## 7 Hires               ""          "84"        "89"        "(p)74"     ""        
## 8 Separations         ""          "75"        "82"        "(p)71"     ""        
## 9 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
## 
## [[1]]
## # A tibble: 6 × 6
##   `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021    Sep.2021  
##   <chr>               <chr>       <chr>       <chr>       <chr>       <chr>     
## 1 Employment (in tho… ""          ""          ""          ""          ""        
## 2 Employment, all em… ""          "9,791.0"   "9,839.0"   "(p)9,897.… "(p)9,952…
## 3 Employment, produc… ""          "7,513.7"   "7,593.2"   "(p)7,613.… ""        
## 4 Unemployment        ""          ""          ""          ""          ""        
## 5 Unemployment rate   ""          "3.5%"      "3.2%"      "2.7%"      "2.5%"    
## 6 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
## 
## [[1]]
## # A tibble: 4 × 6
##   `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021    Sep.2021  
##   <chr>               <chr>       <chr>       <chr>       <chr>       <chr>     
## 1 Employment (in tho… ""          ""          ""          ""          ""        
## 2 Employment, all em… ""          "2,329.9"   "2,329.8"   "(p)2,336.… "(p)2,339…
## 3 Employment, produc… ""          "1,533.4"   "1,526.8"   "(p)1,521.… ""        
## 4 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
## 
## [[1]]
## # A tibble: 4 × 6
##   `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021    Sep.2021  
##   <chr>               <chr>       <chr>       <chr>       <chr>       <chr>     
## 1 Employment (in tho… ""          ""          ""          ""          ""        
## 2 Employment, all em… ""          "8,726.6"   "8,770.3"   "(p)8,790.… "(p)8,791…
## 3 Employment, produc… ""          "7,732.8"   "7,755.2"   "(p)7,847.… ""        
## 4 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
## 
## [[1]]
## # A tibble: 9 × 6
##   `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021    Sep.2021  
##   <chr>               <chr>       <chr>       <chr>       <chr>       <chr>     
## 1 Employment (in tho… ""          ""          ""          ""          ""        
## 2 Employment, all em… ""          "3,535.8"   "3,576.1"   "(p)3,625.… "(p)3,606…
## 3 Unemployment        ""          ""          ""          ""          ""        
## 4 Unemployment rate   ""          "6.4%"      "6.7%"      "6.1%"      "3.9%"    
## 5 Job openings, hire… ""          ""          ""          ""          ""        
## 6 Job openings        ""          "202"       "221"       "(p)184"    ""        
## 7 Hires               ""          "134"       "143"       "(p)209"    ""        
## 8 Separations         ""          "111"       "88"        "(p)127"    ""        
## 9 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnotes… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021   Sep.2021  
##    <chr>               <chr>       <chr>       <chr>       <chr>      <chr>     
##  1 Employment (in tho… ""          ""          ""          ""         ""        
##  2 Employment, all em… ""          "20,001.2"  "20,044.8"  "(p)20,04… "(p)20,05…
##  3 Employment, produc… ""          "17,590.4"  "17,594.0"  "(p)17,61… ""        
##  4 Unemployment        ""          ""          ""          ""         ""        
##  5 Unemployment rate   ""          "3.9%"      "3.8%"      "3.9%"     "3.1%"    
##  6 Job openings, hire… ""          ""          ""          ""         ""        
##  7 Job openings        ""          "1,479"     "1,813"     "(p)1,541" ""        
##  8 Hires               ""          "713"       "763"       "(p)763"   ""        
##  9 Separations         ""          "721"       "728"       "(p)720"   ""        
## 10 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021   Sep.2021  
##    <chr>               <chr>       <chr>       <chr>       <chr>      <chr>     
##  1 Employment (in tho… ""          ""          ""          ""         ""        
##  2 Employment, all em… ""          "2,037.6"   "2,093.4"   "(p)2,142… "(p)2,185…
##  3 Employment, produc… ""          "1,927.0"   "2,031.8"   "(p)2,012… ""        
##  4 Unemployment        ""          ""          ""          ""         ""        
##  5 Unemployment rate   ""          "10.3%"     "8.3%"      "7.7%"     "7.2%"    
##  6 Job openings, hire… ""          ""          ""          ""         ""        
##  7 Job openings        ""          "223"       "234"       "(p)206"   ""        
##  8 Hires               ""          "303"       "199"       "(p)148"   ""        
##  9 Separations         ""          "99"        "119"       "(p)219"   ""        
## 10 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`       Backdata    Jun.2021    Jul.2021    Aug.2021   Sep.2021  
##    <chr>               <chr>       <chr>       <chr>       <chr>      <chr>     
##  1 Employment (in tho… ""          ""          ""          ""         ""        
##  2 Employment, all em… ""          "12,762.9"  "13,115.3"  "(p)13,10… "(p)13,13…
##  3 Employment, produc… ""          "11,417.1"  "11,759.2"  "(p)11,73… ""        
##  4 Unemployment        ""          ""          ""          ""         ""        
##  5 Unemployment rate   ""          "11.0%"     "9.2%"      "9.4%"     "7.8%"    
##  6 Job openings, hire… ""          ""          ""          ""         ""        
##  7 Job openings        ""          "1,552"     "1,793"     "(p)1,541" ""        
##  8 Hires               ""          "1,462"     "1,443"     "(p)1,141" ""        
##  9 Separations         ""          "886"       "1,005"     "(p)1,280" ""        
## 10 Footnotes(p) Preli… "Footnotes… "Footnotes… "Footnotes… "Footnote… "Footnote…
## 
## [[1]]
## # A tibble: 10 × 6
##    `Data series`         Backdata    Jun.2021   Jul.2021   Aug.2021   Sep.2021  
##    <chr>                 <chr>       <chr>      <chr>      <chr>      <chr>     
##  1 Employment (in thous… ""          ""         ""         ""         ""        
##  2 Employment, all empl… ""          "5,665"    "5,709"    "(p)5,743" "(p)5,727"
##  3 Employment, producti… ""          "4,589"    "4,642"    "(p)4,676" "(p)4,657"
##  4 Unemployment          ""          ""         ""         ""         ""        
##  5 Unemployment rate     ""          "5.4%"     "4.9%"     "5.0%"     "4.2%"    
##  6 Job openings, hires,… ""          ""         ""         ""         ""        
##  7 Job openings          ""          "431"      "501"      "(p)499"   ""        
##  8 Hires                 ""          "375"      "327"      "(p)250"   ""        
##  9 Separations           ""          "202"      "275"      "(p)234"   ""        
## 10 Footnotes(p) Prelimi… "Footnotes… "Footnote… "Footnote… "Footnote… "Footnote…

In this next step, we use employment_info <- employment_info[[1]] to extract each dataframe from the nested list. Then we’ll tidy up the dataframe a bit. We’ll use the get rid of all the information we don’t need in the table, by using slice() to keep only the second row. We’ll also standardize the column names with clean_names().

Task 32: Run code to run for loop to clean up tables

Task Run the following for loop to clean up tables. Briefly describe the output that appears below the codeblock. Answer The output shows clean tables with the second row that we wanted.

# For loop, iterating over each row in our naics industry dataframe

for(row_number in 1:nrow(naics_industry)) {

    # Keep only the row for a given row number, get rid of every other row
    each_row_df <- naics_industry %>%
      slice(row_number)

    # Define url of page to get
    url <- each_row_df$sector_url

    # Define id of table to ingest
    xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')

    # Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table().  The dataframe is in a nested list, which we'll have to extract in the next step.
    employment_info <- url %>%
      read_html() %>%
      html_elements(xpath = xpath_employment_table) %>%
      html_table()

    # Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row;
    employment_info <- employment_info[[1]] %>%
      clean_names() %>%
      slice(2)

    # To help us see what's happening as we build this, we're going to print the thing we're creating.  
    print(employment_info)


}
## # A tibble: 1 × 6
##   data_series       backdata jun_2021 jul_2021 aug_2021 sep_2021
##   <chr>             <lgl>    <chr>    <chr>    <chr>    <chr>   
## 1 Unemployment rate NA       7.5%     7.6%     5.7%     5.0%    
## # A tibble: 1 × 6
##   data_series                       backdata jun_2021 jul_2021 aug_2021 sep_2021
##   <chr>                             <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Employment, all employees  (seas… ""       586.7    592.5    (p)598.3 (p)603.0
## # A tibble: 1 × 6
##   data_series                       backdata jun_2021 jul_2021 aug_2021 sep_2021
##   <chr>                             <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Employment, all employees  (seas… ""       538.6    538.9    (p)538.1 (p)537.6
## # A tibble: 1 × 6
##   data_series                       backdata jun_2021 jul_2021 aug_2021 sep_2021
##   <chr>                             <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Employment, all employees  (seas… ""       7,413    7,425    (p)7,425 (p)7,447
## # A tibble: 1 × 6
##   data_series                      backdata jun_2021 jul_2021 aug_2021  sep_2021
##   <chr>                            <chr>    <chr>    <chr>    <chr>     <chr>   
## 1 Employment, all employees  (sea… ""       12,332   12,389   (p)12,420 (p)12,4…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       5,707.1  5,722.1  (p)5,718… (p)5,735…
## # A tibble: 1 × 6
##   data_series                    backdata jun_2021 jul_2021 aug_2021   sep_2021 
##   <chr>                          <chr>    <chr>    <chr>    <chr>      <chr>    
## 1 Employment, all employees  (s… ""       15,361.4 15,355.3 (p)15,351… (p)15,40…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       5,736.8  5,792.9  (p)5,847… (p)5,894…
## # A tibble: 1 × 6
##   data_series                       backdata jun_2021 jul_2021 aug_2021 sep_2021
##   <chr>                             <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Employment, all employees  (seas… ""       2,726    2,745    (p)2,774 (p)2,806
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       6,534.3  6,545.1  (p)6,546… (p)6,541…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       2,271.8  2,291.9  (p)2,302… (p)2,308…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       9,791.0  9,839.0  (p)9,897… (p)9,952…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       2,329.9  2,329.8  (p)2,336… (p)2,339…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       8,726.6  8,770.3  (p)8,790… (p)8,791…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       3,535.8  3,576.1  (p)3,625… (p)3,606…
## # A tibble: 1 × 6
##   data_series                    backdata jun_2021 jul_2021 aug_2021   sep_2021 
##   <chr>                          <chr>    <chr>    <chr>    <chr>      <chr>    
## 1 Employment, all employees  (s… ""       20,001.2 20,044.8 (p)20,046… (p)20,05…
## # A tibble: 1 × 6
##   data_series                     backdata jun_2021 jul_2021 aug_2021  sep_2021 
##   <chr>                           <chr>    <chr>    <chr>    <chr>     <chr>    
## 1 Employment, all employees  (se… ""       2,037.6  2,093.4  (p)2,142… (p)2,185…
## # A tibble: 1 × 6
##   data_series                    backdata jun_2021 jul_2021 aug_2021   sep_2021 
##   <chr>                          <chr>    <chr>    <chr>    <chr>      <chr>    
## 1 Employment, all employees  (s… ""       12,762.9 13,115.3 (p)13,104… (p)13,13…
## # A tibble: 1 × 6
##   data_series                       backdata jun_2021 jul_2021 aug_2021 sep_2021
##   <chr>                             <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 Employment, all employees  (seas… ""       5,665    5,709    (p)5,743 (p)5,727

We now have 19 dataframes, each containing one row each and two columns, one of which is the employment number for a given sector for jun_2021. But we’re missing information about what industry sector these employment numbers represent.

We can add that back in by using bind_cols() to reconnect the each_row_df, which contains the sector code and the sector name.

Task 33: Run code to run for loop to add in data

Task Run the following for loop to add in data. Briefly describe the output that appears below the codeblock. Answer The output shows now shows the data we were missing. We can now see what industry sector these employment numbers represent in our tables.

# For loop, iterating over each row in our naics industry dataframe

for(row_number in 1:nrow(naics_industry)) {

    # Keep only the row for a given row number, get rid of every other row
    each_row_df <- naics_industry %>%
      slice(row_number)

    # Define url of page to get
    url <- each_row_df$sector_url

    # Define id of table to ingest
    xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')

    # Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table().  The dataframe is in a nested list, which we'll have to extract in the next step.
    employment_info <- url %>%
      read_html() %>%
      html_elements(xpath = xpath_employment_table) %>%
      html_table()

    # Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row; use bind_cols() to append the sector code and name to this table.
    employment_info <- employment_info[[1]] %>%
      clean_names() %>%
      slice(2) %>%
      bind_cols(each_row_df)

    # To help us see what's happening as we build this, we're going to print the thing we're creating.  
    print(employment_info)


}
## # A tibble: 1 × 10
##   data_series       backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>             <lgl>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Unemployment rate NA       7.5%     7.6%     5.7%     5.0%     11     Agricultur…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       586.7    592.5    (p)598.3 (p)603.0 21     Mining, Qua…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       538.6    538.9    (p)538.1 (p)537.6 22     Utilities  
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       7,413    7,425    (p)7,425 (p)7,447 23     Constructi…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       12,332   12,389   (p)12,4… (p)12,4… 31-33  Manufactur…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       5,707.1  5,722.1  (p)5,71… (p)5,73… 42     Wholesale …
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       15,361.4 15,355.3 (p)15,3… (p)15,4… 44-45  Retail Tra…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021  sep_2021 sector description
##   <chr>         <chr>    <chr>    <chr>    <chr>     <chr>    <chr>  <chr>      
## 1 Employment, … ""       5,736.8  5,792.9  (p)5,847… (p)5,89… 48-49  Transporta…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       2,726    2,745    (p)2,774 (p)2,806 51     Information
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       6,534.3  6,545.1  (p)6,54… (p)6,54… 52     Finance an…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       2,271.8  2,291.9  (p)2,30… (p)2,30… 53     Real Estate…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       9,791.0  9,839.0  (p)9,89… (p)9,95… 54     Professiona…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       2,329.9  2,329.8  (p)2,33… (p)2,33… 55     Management …
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       8,726.6  8,770.3  (p)8,79… (p)8,79… 56     Administrat…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series    backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description
##   <chr>          <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>      
## 1 Employment, a… ""       3,535.8  3,576.1  (p)3,62… (p)3,60… 61     Educationa…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       20,001.2 20,044.8 (p)20,0… (p)20,0… 62     Health Care…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       2,037.6  2,093.4  (p)2,14… (p)2,18… 71     Arts, Enter…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021  sep_2021 sector description
##   <chr>         <chr>    <chr>    <chr>    <chr>     <chr>    <chr>  <chr>      
## 1 Employment, … ""       12,762.9 13,115.3 (p)13,10… (p)13,1… 72     Accommodat…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>
## # A tibble: 1 × 10
##   data_series   backdata jun_2021 jul_2021 aug_2021 sep_2021 sector description 
##   <chr>         <chr>    <chr>    <chr>    <chr>    <chr>    <chr>  <chr>       
## 1 Employment, … ""       5,665    5,709    (p)5,743 (p)5,727 81     Other Servi…
## # … with 2 more variables: sector_url <chr>, sector_xpath_id <chr>

Then we’ll do a little bit of cleaning.

Let’s use parse_number() to remove the comma from the jun_2021 number and convert it from a character to number. We’ll use rename() to make the jun_2021 column name a little more descriptive. And then we’ll use select() to keep only the columns we want to keep – the sector number, the sector name, and the jun_2021 employment number.

Task 34: Run code to run for loop to clean up tables

Task Run the following for loop to clean up tables. Briefly describe the output that appears below the codeblock. Answer The clean tables are displayed. We have only the columns we wanted to keep.

# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {

    # Keep only the row for a given row number, get rid of every other row
    each_row_df <- naics_industry %>%
      slice(row_number)

    # Define url of page to get
    url <- each_row_df$sector_url

    # Define id of table to ingest
    xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')

    # Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table().  The dataframe is in a nested list, which we'll have to extract in the next step.
    employment_info <- url %>%
      read_html() %>%
      html_elements(xpath = xpath_employment_table) %>%
      html_table()

    # Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row; use bind_cols() to append the sector code and name to this table; turn jun_2021 column into a proper number, and rename it.  Then select only three columns we need.
    employment_info <- employment_info[[1]] %>%
      clean_names() %>%
      slice(2) %>%
      bind_cols(each_row_df) %>%
      mutate(jun_2021 = parse_number(jun_2021)) %>%
      rename(jun_2021_employees = jun_2021) %>%
      select(sector,description,jun_2021_employees)

    # To help us see what's happening as we build this, we're going to print the thing we're creating.  
    print(employment_info)


}
## # A tibble: 1 × 3
##   sector description                                jun_2021_employees
##   <chr>  <chr>                                                   <dbl>
## 1 11     Agriculture, Forestry, Fishing and Hunting                7.5
## # A tibble: 1 × 3
##   sector description                                   jun_2021_employees
##   <chr>  <chr>                                                      <dbl>
## 1 21     Mining, Quarrying, and Oil and Gas Extraction               587.
## # A tibble: 1 × 3
##   sector description jun_2021_employees
##   <chr>  <chr>                    <dbl>
## 1 22     Utilities                 539.
## # A tibble: 1 × 3
##   sector description  jun_2021_employees
##   <chr>  <chr>                     <dbl>
## 1 23     Construction               7413
## # A tibble: 1 × 3
##   sector description   jun_2021_employees
##   <chr>  <chr>                      <dbl>
## 1 31-33  Manufacturing              12332
## # A tibble: 1 × 3
##   sector description     jun_2021_employees
##   <chr>  <chr>                        <dbl>
## 1 42     Wholesale Trade              5707.
## # A tibble: 1 × 3
##   sector description  jun_2021_employees
##   <chr>  <chr>                     <dbl>
## 1 44-45  Retail Trade             15361.
## # A tibble: 1 × 3
##   sector description                    jun_2021_employees
##   <chr>  <chr>                                       <dbl>
## 1 48-49  Transportation and Warehousing              5737.
## # A tibble: 1 × 3
##   sector description jun_2021_employees
##   <chr>  <chr>                    <dbl>
## 1 51     Information               2726
## # A tibble: 1 × 3
##   sector description           jun_2021_employees
##   <chr>  <chr>                              <dbl>
## 1 52     Finance and Insurance              6534.
## # A tibble: 1 × 3
##   sector description                        jun_2021_employees
##   <chr>  <chr>                                           <dbl>
## 1 53     Real Estate and Rental and Leasing              2272.
## # A tibble: 1 × 3
##   sector description                                      jun_2021_employees
##   <chr>  <chr>                                                         <dbl>
## 1 54     Professional, Scientific, and Technical Services               9791
## # A tibble: 1 × 3
##   sector description                             jun_2021_employees
##   <chr>  <chr>                                                <dbl>
## 1 55     Management of Companies and Enterprises              2330.
## # A tibble: 1 × 3
##   sector description                                           jun_2021_employe…
##   <chr>  <chr>                                                             <dbl>
## 1 56     Administrative and Support and Waste Management and …             8727.
## # A tibble: 1 × 3
##   sector description          jun_2021_employees
##   <chr>  <chr>                             <dbl>
## 1 61     Educational Services              3536.
## # A tibble: 1 × 3
##   sector description                       jun_2021_employees
##   <chr>  <chr>                                          <dbl>
## 1 62     Health Care and Social Assistance             20001.
## # A tibble: 1 × 3
##   sector description                         jun_2021_employees
##   <chr>  <chr>                                            <dbl>
## 1 71     Arts, Entertainment, and Recreation              2038.
## # A tibble: 1 × 3
##   sector description                     jun_2021_employees
##   <chr>  <chr>                                        <dbl>
## 1 72     Accommodation and Food Services             12763.
## # A tibble: 1 × 3
##   sector description                                   jun_2021_employees
##   <chr>  <chr>                                                      <dbl>
## 1 81     Other Services (except Public Administration)               5665

We’re getting very close to the finished table we showed at the beginning.

But right now, each bit of sector information is separated between 19 different dataframes.

We want them in one dataframe.

We can fix this by creating an empty dataframe called “employment_by_sector_all” using tibble(), placing it before our “for loop”.

And inside our “for loop” at the end, we’ll bind each employment_info dataframe to the newly created empty dataframe.

Task 35: Run code to run for loop to combine tables into a single table

Task Run the following for loop combine tables into a single table. Briefly describe the output that appears below the codeblock. Answer Now, everything is in one dataframe.

# Create an empty dataframe to hold results
employment_by_sector_all <- tibble()

# For loop, iterating over each row in our naics industry dataframe
for(row_number in 1:nrow(naics_industry)) {

    # Keep only the row for a given row number, get rid of every other row
    each_row_df <- naics_industry %>%
      slice(row_number)

    # Define url of page to get
    url <- each_row_df$sector_url

    # Define id of table to ingest
    xpath_employment_table <- paste0('//*[@id="',each_row_df$sector_xpath_id,'"]')

    # Get employment table from each page by going to each url defined above, reading in the html with read_html(), extracting the table with the id generated by the xpath code using html_elements), and then turning the html into a proper dataframe using html_table().  The dataframe is in a nested list, which we'll have to extract in the next step.
    employment_info <- url %>%
      read_html() %>%
      html_elements(xpath = xpath_employment_table) %>%
      html_table()

    # Grab the dataframe out of the list (it's the first and only element inside the list); clean up the field names with clean_names(); use slice(2) to keep only the second row; use bind_cols() to append the sector code and name to this table; turn jun_2021 column into a proper number, and rename it.  Then select only three columns we need.
    employment_info <- employment_info[[1]] %>%
      clean_names() %>%
      slice(2) %>%
      bind_cols(each_row_df) %>%
      mutate(jun_2021 = parse_number(jun_2021)) %>%
      rename(jun_2021_employees = jun_2021) %>%
      select(sector,description,jun_2021_employees)

    # Bind each individual employment info table to our employment_by_sector_all dataframe
    employment_by_sector_all <- employment_by_sector_all %>%
      bind_rows(employment_info)

}

# Display the completed dataframe
employment_by_sector_all
## # A tibble: 19 × 3
##    sector description                                          jun_2021_employe…
##    <chr>  <chr>                                                            <dbl>
##  1 11     Agriculture, Forestry, Fishing and Hunting                         7.5
##  2 21     Mining, Quarrying, and Oil and Gas Extraction                    587. 
##  3 22     Utilities                                                        539. 
##  4 23     Construction                                                    7413  
##  5 31-33  Manufacturing                                                  12332  
##  6 42     Wholesale Trade                                                 5707. 
##  7 44-45  Retail Trade                                                   15361. 
##  8 48-49  Transportation and Warehousing                                  5737. 
##  9 51     Information                                                     2726  
## 10 52     Finance and Insurance                                           6534. 
## 11 53     Real Estate and Rental and Leasing                              2272. 
## 12 54     Professional, Scientific, and Technical Services                9791  
## 13 55     Management of Companies and Enterprises                         2330. 
## 14 56     Administrative and Support and Waste Management and…            8727. 
## 15 61     Educational Services                                            3536. 
## 16 62     Health Care and Social Assistance                              20001. 
## 17 71     Arts, Entertainment, and Recreation                             2038. 
## 18 72     Accommodation and Food Services                                12763. 
## 19 81     Other Services (except Public Administration)                   5665

Ta da! The end result is a nice tidy dataframe with the number of employees in June 2021 for each sector.

It’s always a good idea to spot check the results, especially any values that look suspiciously high or low.

The value for “Agriculture, Forestry, Fishing and Hunting” seems suspiciously low, compared with the other values.

Let’s figure out why.

Here’s the table on the mining sector page: https://www.bls.gov/iag/tgs/iag21.htm

Task 36: Go to the web page linked above

Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage gives information about the Mining, Quarrying, and Oil and Gas Extraction sector. There is a table on workforce statistics, showing employment/unemployment.

Task 37: Load image

Task Run the following code to display an image showing what you should see on the web page.

And here’s the table for the agriculture sector.

https://www.bls.gov/iag/tgs/iag11.htm

Task 38: Go to the web page linked above

Task Visit the web page linked above. Briefly describe what you see there. Answer The webpage gives information about the Agriculture, Forestry, Fishing and Hunting sector. There is a table on workforce statistics, showing the unemployment rate.

Task 39: Load image

Task Run the following code to display an image showing what you should see on the web page.

Unlike mining – and every other sector page (I checked each page) – the agriculture page is structured differently.

In the second row of this table, it has the unemployment rate. Nowhere on the page can we find information on the number of employees. We would need to do additional research to track down a valid number if we plan on using this table, but for now we’re going to replace it with an NA using na_if.

Task 40: Run code to remove agriculture from table

Task Run the following code to remove agriculture from table. Briefly describe the output that appears below the codeblock. Answer It shows the sector table with an NA under the number of employees.

# remove the suspicious value for agriculture.
employment_by_sector_all <- employment_by_sector_all %>%
  mutate(jun_2021_employees = na_if(jun_2021_employees,7.5))

# display it
employment_by_sector_all
## # A tibble: 19 × 3
##    sector description                                          jun_2021_employe…
##    <chr>  <chr>                                                            <dbl>
##  1 11     Agriculture, Forestry, Fishing and Hunting                         NA 
##  2 21     Mining, Quarrying, and Oil and Gas Extraction                     587.
##  3 22     Utilities                                                         539.
##  4 23     Construction                                                     7413 
##  5 31-33  Manufacturing                                                   12332 
##  6 42     Wholesale Trade                                                  5707.
##  7 44-45  Retail Trade                                                    15361.
##  8 48-49  Transportation and Warehousing                                   5737.
##  9 51     Information                                                      2726 
## 10 52     Finance and Insurance                                            6534.
## 11 53     Real Estate and Rental and Leasing                               2272.
## 12 54     Professional, Scientific, and Technical Services                 9791 
## 13 55     Management of Companies and Enterprises                          2330.
## 14 56     Administrative and Support and Waste Management and…             8727.
## 15 61     Educational Services                                             3536.
## 16 62     Health Care and Social Assistance                               20001.
## 17 71     Arts, Entertainment, and Recreation                              2038.
## 18 72     Accommodation and Food Services                                 12763.
## 19 81     Other Services (except Public Administration)                    5665

And we’re done.

A note about advanced scraping – every site is different. Every time you want to scrape a site, you’ll be puzzling over different problems. But the steps remain the same: find a pattern, exploit it, clean the data on the fly and put it into a place to store it.